Skip to content

feat(operator): enforce RFC 1035 validation for TrainJob name#2767

Merged
google-oss-prow[bot] merged 3 commits into
kubeflow:masterfrom
juniemariam:feat/webhook-validate-trainjob-name
Aug 13, 2025
Merged

feat(operator): enforce RFC 1035 validation for TrainJob name#2767
google-oss-prow[bot] merged 3 commits into
kubeflow:masterfrom
juniemariam:feat/webhook-validate-trainjob-name

Conversation

@juniemariam

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

  • Adds validation to enforce RFC 1035 compliance for TrainJob.metadata.name
  • Checks for:
    • DNS-1035 format (lowercase, alphanumeric, -, etc.)
    • Max length of 63 characters

Why this is needed

This is a follow-up to #2734. Previously, invalid TrainJob names caused JobSet webhook rejections during reconciliation. This PR ensures invalid names are caught earlier at the TrainJob admission webhook level, improving clarity and user experience.

Related Issues

Fixes #2735
Related to #2621, #2734

Tests

  • Added unit tests for valid, uppercase, and too-long names
  • Verified via go test ./pkg/webhooks/... and make test
  • Unit tests, Integration tests passed
  • E2E test for PyTorch runtime passed
  • E2E test for OpenMPI runtime failed due to timeout

@juniemariam juniemariam changed the title feat(webhook): enforce RFC 1035 validation for TrainJob name feat(operator): enforce RFC 1035 validation for TrainJob name Aug 2, 2025
@juniemariam juniemariam force-pushed the feat/webhook-validate-trainjob-name branch from 6d4897e to 45be2cc Compare August 2, 2025 06:16

@kannon92 kannon92 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you should just add this validation to the api?

you could easily do this via a CEL expression and avoid having to add it to the webhooks.

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@andreyvelich

Copy link
Copy Markdown
Member

/retest

@juniemariam

juniemariam commented Aug 5, 2025

Copy link
Copy Markdown
Contributor Author

Thanks for the suggestion @kannon92 — that’s a great point! I followed the pattern from PRs like #2734, where similar validations used webhooks. I'm still getting familiar with the project, so if CEL would be a better fit, I’d really appreciate any guidance and happy to update the PR if needed.

@kannon92

kannon92 commented Aug 6, 2025

Copy link
Copy Markdown
Contributor

Thanks for the suggestion @kannon92 — that’s a great point! I followed the pattern from PRs like #2734, where similar validations used webhooks. I'm still getting familiar with the project, so if CEL would be a better fit, I’d really appreciate any guidance and happy to update the PR if needed.

Hmm. I’ll leave that to the maintainers. I like to follow similar patterns as the other code.

cc @tenzen-y @andreyvelich

@astefanutti

Copy link
Copy Markdown
Contributor

@juniemariam you can find some information about CEL validation rules in https://kubernetes.io/blog/2022/09/23/crd-validation-rules-beta/ and the documentation at
https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules

The idea is to specify a CEL rule directly in the Golang API using the +kubebuilder:validation:XValidation:rule marker.

There is also the format library that conveniently exposes the dns1035Label function accessible from CEL rules: https://kubernetes.io/docs/reference/using-api/cel/#kubernetes-format-library

@coveralls

coveralls commented Aug 8, 2025

Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 16923747863

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.2%) to 47.791%

Totals Coverage Status
Change from base Build 16882514285: -0.2%
Covered Lines: 941
Relevant Lines: 1969

💛 - Coveralls

@juniemariam

Copy link
Copy Markdown
Contributor Author

Update: Switched from webhook-based validation to CEL-based validation for TrainJob.metadata.name.

  • Added CEL validation rules to enforce RFC 1035 compliance
  • Removed webhook logic for name validation
  • Regenerated CRDs with make manifests and committed changes

Verified with:

  • kubectl apply (valid + invalid test cases)
  • make test (unit tests pass)
  • also ran integration and E2E test cases (passed)

Let me know if further changes are needed!
/cc @andreyvelich @astefanutti

@google-oss-prow google-oss-prow Bot requested a review from andreyvelich August 9, 2025 00:19
@kannon92

Copy link
Copy Markdown
Contributor

Looking really good!

It would be nice to see some integration tests for this logic in this code

// +kubebuilder:storageversion
// +kubebuilder:printcolumn:name="State",type=string,JSONPath=`.status.conditions[-1:].type`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
// +kubebuilder:validation:XValidation:rule="self.metadata.name.matches('^[a-z]([-a-z0-9]*[a-z0-9])?$')", message="metadata.name must match RFC 1035 DNS label format"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would using !format.dns1123Label().validate(self.metadata.name).hasValue() be slightly better?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @astefanutti ! I initially tried using the format library expression as recommended in the Kubernetes docs (e.g., !format.dns1035Label().validate(self.metadata.name).hasValue()), but it failed to pass the CEL cost estimation and could not be applied due to cost budget errors in the CRD. Hence I changed to regex check

Please let me know if you'd like me to try any other alternatives or if this existing approach works for now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juniemariam thanks for the explanation. That looks good to me as it is.

wantError: nil,
wantWarnings: nil,
},
"invalid trainjob name with uppercase letters": {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed this test would/could be kept?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Just to clarify—now that the validation is enforced in the CRD using CEL, do we still need to keep the Go/webhook logic and its related unit tests?

I moved the tests out of the webhook layer because the validation logic now lives in the CRD, so the webhook no longer performs this check.

Could you please advise whether it’s better to keep both the Go logic and tests (alongside the CEL rule and integration test), or just rely on CEL + integration tests?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juniemariam that makes sense. Yes an integration test could be useful.

Signed-off-by: Junie <juniemariam@gmail.com>
…Job name

Signed-off-by: Junie <juniemariam@gmail.com>
@juniemariam juniemariam force-pushed the feat/webhook-validate-trainjob-name branch from e4487b4 to 8a0b2ca Compare August 12, 2025 23:57
@astefanutti

Copy link
Copy Markdown
Contributor

/ok-to-test

@astefanutti

Copy link
Copy Markdown
Contributor

/lgtm

Thanks @juniemariam!

@tenzen-y tenzen-y left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm
/approve

@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit 952f943 into kubeflow:master Aug 13, 2025
20 checks passed
@google-oss-prow google-oss-prow Bot added this to the v2.1 milestone Aug 13, 2025
alexxfan pushed a commit to red-hat-data-services/trainer that referenced this pull request Nov 24, 2025
…ow#2767)

* feat(webhook): enforce RFC 1035 validation for TrainJob name

Signed-off-by: Junie <juniemariam@gmail.com>

* feat(operator): enforce RFC 1035 validation using CEL rules for TrainJob name

Signed-off-by: Junie <juniemariam@gmail.com>

* test(integration): added integration test for TrainJob name validation

Signed-off-by: Junie <juniemariam@gmail.com>

---------

Signed-off-by: Junie <juniemariam@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhanced validation of the name of TrainJob

6 participants